AITopics | head pose

Collaborating Authors

head pose

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

OMNIGAZE: Reward-inspired Generalizable Gaze Estimation in the Wild

Neural Information Processing SystemsJun-20-2026, 06:46:29 GMT

Current 3D gaze estimation methods struggle to generalize across diverse data domains, primarily due to i) the scarcity of annotated datasets, and ii) the insufficient diversity of labeled data. In this work, we present OMNIGAZE, a semi-supervised framework for 3D gaze estimation, which utilizes large-scale unlabeled data collected from diverse and unconstrained real-world environments to mitigate domain bias and generalize gaze estimation in the wild. First, we build a diverse collection of unlabeled facial images, varying in facial appearances, background environments, illumination conditions, head poses, and eye occlusions. In order to leverage unlabeled data spanning a broader distribution, OMNIGAZE adopts a standard pseudo-labeling strategy and devises a reward model to assess the reliability of pseudo labels. Beyond pseudo labels as 3D direction vectors, the reward model also incorporates visual embeddings extracted by an off-the-shelf visual encoder and semantic cues from gaze perspective generated by prompting a Multimodal Large Language Model to compute confidence scores. Then, these scores are utilized to select high-quality pseudo labels and weight them for loss computation. Extensive experiments demonstrate that OMNIGAZE achieves state-of-the-art performance on five datasets under both in-domain and cross-domain settings. Furthermore, we also evaluate the efficacy of OMNIGAZE as a scalable data engine for gaze estimation, which exhibits robust zero-shot generalization on four unseen datasets.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.87)

Industry:

Health & Medicine > Therapeutic Area (0.67)
Information Technology > Security & Privacy (0.46)
Education > Educational Setting > Online (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.70)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction

Russell, Sam O'Connor, Harte, Naomi

arXiv.org Artificial IntelligenceOct-27-2025

Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of automatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.

machine learning, natural language, skantze, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2025.findings-acl.12

2505.21043

Country:

Europe (0.28)
North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Vision > Face Recognition (0.70)

Add feedback

DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis

Chen, Peiyin, Yang, Zhuowei, Feng, Hui, Jiang, Sheng, Yan, Rui

arXiv.org Artificial IntelligenceOct-14-2025

Audio-driven talking-head generation has advanced rapidly with diffusion-based generative models, yet producing temporally coherent videos with fine-grained motion control remains challenging. We propose DEMO, a flow-matching generative framework for audio-driven talking-portrait video synthesis that delivers disentangled, high-fidelity control of lip motion, head pose, and eye gaze. The core contribution is a motion auto-encoder that builds a structured latent space in which motion factors are independently represented and approximately orthogonalized. On this disentangled motion space, we apply optimal-transport-based flow matching with a transformer predictor to generate temporally smooth motion trajectories conditioned on audio. Extensive experiments across multiple benchmarks show that DEMO outperforms prior methods in video realism, lip-audio synchronization, and motion fidelity. These results demonstrate that combining fine-grained motion disentanglement with flow-based generative modeling provides a powerful new paradigm for controllable talking-head video synthesis.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2510.1065

Country: Asia > China > Jiangsu Province (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

014fe398da515cd552fa6e1f33e0565e-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 16:58:45 GMT

computer vision and pattern recognition, ieee cvf conference, video, (11 more...)

Neural Information Processing Systems

Country: Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry:

Information Technology (0.67)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

WEBEYETRACK: Scalable Eye-Tracking for the Browser via On-Device Few-Shot Personalization

Davalos, Eduardo, Zhang, Yike, Srivastava, Namrata, Thatigotla, Yashvitha, Salas, Jorge A., McFadden, Sara, Cho, Sun-Joo, Goodwin, Amanda, TS, Ashwin, Biswas, Gautam

arXiv.org Artificial IntelligenceAug-28-2025

With advancements in AI, new gaze estimation methods are exceeding state-of-the-art (SOTA) benchmarks, but their real-world application reveals a gap with commercial eye-tracking solutions. Factors like model size, inference time, and privacy often go unaddressed. Meanwhile, webcam-based eye-tracking methods lack sufficient accuracy, in particular due to head movement. To tackle these issues, we introduce We bEyeTrack, a framework that integrates lightweight SOTA gaze estimation models directly in the browser. It incorporates model-based head pose estimation and on-device few-shot learning with as few as nine calibration samples (k < 9). WebEyeTrack adapts to new users, achieving SOTA performance with an error margin of 2.32 cm on GazeCapture and real-time inference speeds of 2.4 milliseconds on an iPhone 14. Our open-source code is available at https://github.com/RedForestAi/WebEyeTrack.

artificial intelligence, estimation, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2508.19544

Country: North America > United States (1.00)

Genre: Research Report (0.82)

Industry:

Information Technology (0.93)
Government > Regional Government (0.47)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Pose-Robust Calibration Strategy for Point-of-Gaze Estimation on Mobile Phones

Zhao, Yujie, Zeng, Jiabei, Shan, Shiguang

arXiv.org Artificial IntelligenceAug-15-2025

Although appearance-based point-of-gaze (PoG) estimation has improved, the estimators still struggle to generalize across individuals due to personal differences. Therefore, person-specific calibration is required for accurate PoG estimation. However, calibrated PoG estimators are often sensitive to head pose variations. To address this, we investigate the key factors influencing calibrated estimators and explore pose-robust calibration strategies. Specifically, we first construct a benchmark, MobilePoG, which includes facial images from 32 individuals focusing on designated points under either fixed or continuously changing head poses. Using this benchmark, we systematically analyze how the diversity of calibration points and head poses influences estimation accuracy. Our experiments show that introducing a wider range of head poses during calibration improves the estimator's ability to handle pose variation. Building on this insight, we propose a dynamic calibration strategy in which users fixate on calibration points while moving their phones. This strategy naturally introduces head pose variation during a user-friendly and efficient calibration process, ultimately producing a better calibrated PoG estimator that is less sensitive to head pose variations than those using conventional calibration strategies. Codes and datasets are available at our project page.

artificial intelligence, human computer interaction, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.10268

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Human Computer Interaction (1.00)
Information Technology > Communications > Mobile (1.00)
(3 more...)

Add feedback

Vision in Action: Learning Active Perception from Human Demonstrations

Xiong, Haoyu, Xu, Xiaomeng, Wu, Jimmy, Hou, Yifan, Bohg, Jeannette, Song, Shuran

arXiv.org Artificial IntelligenceJun-19-2025

ViA learns task-relevant active perceptual strategies (e.g., searching, tracking, and focusing) directly from human demonstrations. On the hardware side, ViA employs a simple yet effective 6-DoF robotic neck to enable flexible, human-like head movements. To capture human active perception strategies, we design a VR-based teleoperation interface that creates a shared observation space between the robot and the human operator. To mitigate VR motion sickness caused by latency in the robot's physical movements, the interface uses an intermediate 3D scene representation, enabling real-time view rendering on the operator side while asynchronously updating the scene with the robot's latest observations. Together, these design elements enable the learning of robust visuomotor policies for three complex, multi-stage bimanual manipulation tasks involving visual occlusions, significantly outperforming baseline systems.

artificial intelligence, human computer interaction, robot, (16 more...)

arXiv.org Artificial Intelligence

2506.15666

Country: Europe (0.28)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.68)

Add feedback

NeuGaze: Reshaping the future BCI

Yang, Yiqian

arXiv.org Artificial IntelligenceApr-22-2025

Traditional brain-computer interfaces (BCIs), reliant on costly electroencephalography or invasive implants, struggle with complex human-computer interactions due to setup complexity and limited precision. We present NeuGaze, a novel webcam-based system that leverages eye gaze, head movements, and facial expressions to enable intuitive, real-time control using only a standard 30 Hz webcam, often pre-installed in laptops. Requiring minimal calibration, NeuGaze achieves performance comparable to conventional inputs, supporting precise cursor navigation, key triggering via an efficient skill wheel, and dynamic gaming interactions, such as defeating formidable opponents in first-person games. By harnessing preserved neck-up functionalities in motor-impaired individuals, NeuGaze eliminates the need for specialized hardware, offering a low-cost, accessible alternative to BCIs. This paradigm empowers diverse applications, from assistive technology to entertainment, redefining human-computer interaction for motor-impaired users. Project is at \href{https://github.com/NeuSpeech/NeuGaze}{github.com/NeuSpeech/NeuGaze}.

artificial intelligence, human computer interaction, threshold, (17 more...)

arXiv.org Artificial Intelligence

2504.15101

Country:

North America > United States (0.46)
Asia > China (0.28)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Public Health (0.94)
Leisure & Entertainment > Games (0.93)
(2 more...)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Robots (0.69)
Information Technology > Artificial Intelligence > Vision > Face Recognition (0.50)
Information Technology > Artificial Intelligence > Cognitive Science > Neuroscience (0.36)

Add feedback

Learning Predictive Visuomotor Coordination

Jia, Wenqi, Lai, Bolin, Liu, Miao, Xu, Danfei, Rehg, James M.

arXiv.org Artificial IntelligenceMar-29-2025

Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2503.233

Country: North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Vision (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

Cheng, Hanbo, Lin, Limin, Liu, Chenyu, Xia, Pengcheng, Hu, Pengfei, Ma, Jiefeng, Du, Jun, Pan, Jia

arXiv.org Artificial IntelligenceOct-18-2024

Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (Dynamic frame Avatar With Non-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Talking head generation aims at synthesizing a realistic and expressive talking head from a given portrait and audio clip, which is garnering growing interest due to its potential applications in virtual meetings, gaming, and film production. For talking head generation, it is essential that the lip motions in the generated video precisely match the accompanying speech, while maintaining high overall visual fidelity (Guo et al., 2021a). Furthermore, natural coordination between head pose, eye blinking, and the rhythm of the audio is also crucial for a convincing output (Liu et al., 2023).

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2410.13726

Country:

North America > United States (0.04)
Europe > Greece > Attica > Athens (0.04)
Asia > Taiwan > Taiwan Province > Taipei (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Media (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback